Browser
Volume Number: 7
Issue Number: 8
Column Tag: C Workshop
A Fast Text Browser
By Allen Stenger, Gardena, CA
The Browser is a Think C program for viewing and searching (but not
modifying) text files. It is an example of using the Think Class Library to build a useful
application with only a modest effort. The Browser uses a fast search algorithm
invented by Boyer and Moore. Running on a Mac Plus, the program can open a 100
Kbyte text file in 3 seconds and can search it in 0.4 seconds.
Simplifying Assumptions
I often need to look up things in text-only files; e.g., to look up a topic in the
Macintosh Technical Notes index or to look at a program source file. Borland’s NotePad+
(part of SideKick) is very handy for this, but it is based on TextEdit. This makes it
slow, and incapable of handling large files (greater than 28K). I needed something fast
and capacious.
Because of the limited use to which I would put it, I could make several
simplifying assumptions. First, it would only read files, so I didn’t have to worry about
memory management on a changing file. Second, the files are already formatted into
lines, so I could ignore text-wrap and write each “paragraph” (i.e., string terminated
by a Return) on a single line (as long as I had a way to scroll horizontally through any
extra-long lines). Third, I never extracted any text from these files, so I didn’t have to
worry about highlighting selections or indeed about text selection at all. This requires a
slightly non-Macish way of doing repeated searches. I defined “Find” as “Find from the
top of the file” and “Find Again” as “Find from the last found position.”
Mapping into TCL
The Think Class Library takes care of most of the user interface details. I (being
so lazy) started with that -- specifically with the Starter application. (Symantec
includes a TinyEdit example based on Starter, but it uses TextEdit and so did not meet
my performance criteria.)
Although this is nominally “object-oriented programming,” objects play little
explicit role and the Browser is more an example of “software reuse.” The Starter
application already does much of what I wanted, and it was organized so that I could
easily make the desired changes. The one visible advantage of objects themselves in this
application is the ability to open multiple documents with no special programming -
each document is a separate object and does not interfere with the others.
(Easing software reuse may be the greatest contribution that object-oriented
programming will make: the structure of OOP programs makes it easy to modify them
without destroying already-working code in the process.)
The TCL is not a complete class library, such as is provided with Smalltalk
implementations. It is a “user interface” class library and deals with all the details of
the Macintosh user interface: menus, windows, scrolling, etc. You can define your own
class libraries built on or separate from TCL, through the OOP features of Think C.
If (like me) you have avoided relocatable blocks because of the perceived
difficulties and if (like me) you make lots of mistakes using them, you will have a lot of
trouble with TCL. All the instance variables are in a relocatable block, and you have to
be extra-careful either to lock the block (the handle’s name is “self”) before doing
anything that might move memory or else copy the variables you need into local,
non-relocatable variables (the recommended technique, called “shadowing” by
Symantec). Realize that memory may be moved not just through obvious trap calls such
as NewHandle but also by calling a subroutine (if the subroutine is in another,
non-loaded segment and LoadSeg is called to bring it in). This makes the shadowing of
string variables a little tricky, since the obvious way to copy to a local variable is to
use strcpy -- a subroutine call. This implies that in general you should not define
function parameters to be strings.
Shadowing is the clumsiest and most error-prone aspect of the Think object
implementation. Unfortunately I can’t think of any better way to do it, without giving
up the memory-management benefits of relocatable blocks.
The Browser, like the Starter, defines subclasses of the three classes
CApplication, CDocument, and CPanorama. The center of attention is CBrowserDoc, the
subclass of CDocument. It handles everything involving creating and destroying
documents and searching. CBrowserPane, a subclass of CPanorama (not of CPane as the
name suggests), has the relatively easy job of updating the window when something
changes in it. The TCL takes care of window sizing, scroll bar manipulation, etc. and
sends itsMainPane (the single instance of CBrowserPane for this document) a message
to update the window. The message is defined in “Panorama” coordinates, which are
both the coordinates relative to the document itself and the QuickDraw local coordinates.
This makes it easy for itsMainPane to figure out what part of the document needs to be
rewritten and where it goes in the window. CBrowserApp defines the application itself,
and deals mostly with opening documents. Browser is the main program, and is
identical with Starter except the names; it creates an instance of CBrowserApp and
starts its running.
The TCL translates menu selections into “commands.” A command passes through
a “chain of command” of objects until it reaches an object that knows how to process it.
Every object in the chain of command is an instance of CBureaucrat and as such has an
instance variable defining its “supervisor”; if the object cannot process a command it
passes it explicitly to its supervisor. For the Browser the chain starts at the instance
of CBrowserPane, chains upward to the instance of CBrowserDoc, and ends with the
instance of CBrowserApp. If no object in the chain can process a command, it is lost.
(You can simulate the effect of a menu selection by calling DoCommand with the desired
command; CBrowserApp::StartUpAction does this to open a file when the application
begins running. You also can invent your own commands, not tied to a menu, if for some
reason you want to send controls along the chain of command.)
The processing within an object also goes through a chain, but this one is a
“class inheritance chain.” For example, the document object is an instance of
CBrowserDoc, which has a DoCommand method that handles Find and Find Again. Any
other commands are passed explicitly (by a call to inherited::DoCommand) to its
superclass, CDocument, which handles commands dealing with open documents (e.g.,
Close). CDocument passes any leftover commands explicitly to its superclass,
CDirector, which then passes any remaining commands explicitly to its superclass,
CBureaucrat. This is the end of the line for an instance of CBureaucrat: the DoCommand
method in CBureaucrat passes all commands reaching it explicitly up to the object’s
supervisor.
Another example is the pane object. It is an instance of CBrowserPane, which has
no DoCommand method. Its superclass, CPanorama, also has no DoCommand method, so
the object eventually inherits its DoCommand from CBureaucrat, which always sends
the command to the supervisor.
Search Algorithm
This program uses a string-search algorithm invented by Robert S. Boyer and J
Strother Moore. The algorithm itself is simple, but it was derived by some very clever
reasoning.
Suppose we want to find the string onion in a file that begins “Colored cartoons of
giant onions”. The common approach would be to “align” the search string and the
file string and compare left-to-right until a mismatch is found, then “shift” the
search string right one and repeat. So we would start with
onion
Colored cartoons of giant onions
and observe that o and C don’t match. We would shift onion right one to get
onion
Colored cartoons of giant onions
where now the two os match but then n and l don’t match, so we shift again. This
continues until we have shifted all the way to onions, one position at a time.
Boyer and Moore observed that we can speed up the search by extracting some
more information from the failures. Their first innovation is to compare right-to-left,
so we would be matching the final n of onion against the r of Colored, which again fails.
So far this is no improvement, but we can observe that not only does n not match r, but
there is no r anywhere in the search string, so we can immediately shift it right by its
length rather than by one, bypassing all the hopeless intermediate searches.
onion
Colored cartoons of giant onions
Repeating, we see that a of cartoons occurs nowhere in onion, so we can shift
right another 5 characters.
onion
Colored cartoons of giant onions
If a had occurred in the search string, we could not have shifted by the search
string’s length, but we could shift to align the rightmost occurrence of a with its
occurrence in the file string. This process is the “inner loop” of the Boyer-Moore
search algorithm. The decision whether a character occurs in the search string, and if
so its rightmost position in the search string, is made by a lookup table (called delta0)
prepared at the beginning of the search.
Continuing, we see this time that there is a match of the two ns, so we move
(left) to the next characters, which are two os and also match. The next comparison
fails (i against o). Our inner loop method does not help here; it suggests that we should
shift the search string left one to align second o of onion with the the first o of cartoons,
but we already know there are no matches to the left of the current position. What can
we do?
One method that will always work is merely to shift the search string another
position to the right (we know there is no match at the current position or to the left).
But Boyer and Moore have added a further refinement to wring even more information
out of the failed matches. In the present case we know that the position being examined
in the file string has on preceded by something other than i. If the search string also
has on preceded by something other than i, we should shift the search string right to
align this with its occurrence in the file string. For most search strings there would be
no reoccurrence, so we could shift by the length of the search string; but here on does
indeed reoccur, preceded by nothing, so we shift to align the first on of onion:
onion
Colored cartoons of giant onions
The amount of shift is also pre-calculated for each trailing string of the search
string and is stored in the lookup table delta2. The location of the reoccurrence is called
the “rightmost plausible reoccurrence (RPR).” For example, the RPR of on is the
first on of onion, but the trailing string n has no RPR since there is no n preceded by
something other than o. If there is no RPR the delta2 table commands a shift by the
length of the search string.
We can continue the search as before. The o of of does not match the n of onion,
and we can shift right one to align the os.
onion
Colored cartoons of giant onions
Now the n and the f do not match, and there are no fs in onion, so we shift right 5
to get
onion
Colored cartoons of giant onions
Now the ns match but the o and the a don’t, and the RPR rule tells us we can again
shift 5.
onion
Colored cartoons of giant onions
Now the n and the i don’t match, and we shift 2 to align the is.
onion
Colored cartoons of giant onions
This is finally the match we are looking for, and the algorithm merely compares
character-by-character to confirm this.
The original Boyer-Moore algorithm deals only with exact (case-sensitive)
matches, but in most Macintosh work we want case-insensitive matches. The algorithm
is easily modified for this: construct delta0 and delta2 by considering uppercase and
lowercase the same, then do the character-by-character compares as case-insensitive.
The speed of the lookups is unaffected by this, and they take nearly all the running time
of the algorithm. The algorithm does slow down somewhat because there are now more
false partial matches than if we had stuck to case-sensitive, but even so it is nearly as
fast as the case-sensitive version.
Notes
The Think C debugger gives no visibility into the heap, and so is not much help in
finding heap problems. Jasik Design’s The Debugger is excellent for solving heap
problems. Turn on Heap Check to check heap corruption; you also may step
through the program (e.g., by every Memory Manager trap), dumping the heap at
each stop to check incorrect types or sizes. The Debugger also has an excellent
interface into the Think C project file that will show you the C source alone or
interspersed with the assembly language code. This lets you see where you are in
the program better than MacsBug does. It is also good for examining the quality of
the generated code - I used it for tuning Search.c and LineCount.c.
• If you get mysterious errors with the Think C debugger on but not with it off,
you are almost certainly the victim of heap problems and you should check all
your instance variables.
• One drawback of the TCL is that it develops only applications. I would have liked
the Browser to be a desk accessory (like NotePad+), but I could not find a
convenient way to do this. It seems that it would not be hard to alter the TCL to
produce DAs; the biggest changes would be to change the resource number
references from constants defined in Constants.h to calculated values, and to feed
events into CSwitchboard::ProcessEvent from the DA’s control routine rather than
let it pull them from the event queue itself. I did not want to alter the TCL, and I
could not see how to do this just by overriding existing classes. If DAs do not
become extinct with the advent of System 7.0, perhaps Symantec could put these
changes in an upgrade to the TCL.
• Each object disposes of itself so be sure to do all necessary cleanup for the class
before issuing inherited::Dispose(). This cleanup includes disposing of any
handles and any objects created by the object. For example, the pane
(itsMainPane) is disposed of by CBrowserDoc::Dispose().
• The “panorama units” are used to quantize the amount of scrolling, e.g. into
whole numbers of lines or characters. The Browser sets the units to one line
vertically and one character horizontally; the TCL scrolls only in multiples of
these amounts. The arguments to CPanorama::ScrollTo are in panorama units,
since it deals with scrolling. The arguments to the Draw function (defined in
CPane and overridden in CBrowserPane) are in QuickDraw local coordinates,
since it deals with a window and not specifically with scrolling.
• One user-interface detail not handled by the TCL is proper updating of a window
if part of it has been invalidated just before a scroll. This happens in the Browser
when doing a Find, since the Find dialog box comes up in front of the Browser
window and invalidates part of it. The Toolbox ScrollRect routine does not translate
the invalid region (since it doesn’t know about it), and the TCL generates a
Draw() call for the old invalid region (which is now in a different place in the
window). This is corrected by issuing a call to CWindow::Update() just after the
modal dialog to redraw the area under the dialog. If this is omitted, a white space
that is the scroll of the dialog box area is left in the window (unless this area is
scrolled totally off-screen).
For Further Reading
1. Robert S. Boyer and J Strother Moore, “A Fast String Searching Algorithm.”
Communications of the ACM, v. 20, no. 10, pp. 762-772 (October 1977). Describes
the Browser’s search algorithm. Also has much empirical and theoretical analysis of
the algorithm’s performance. (ACM = Association for Computing Machinery)
2. Donald E. Knuth, James H. Morris, and Vaughn R. Pratt, “Fast Pattern
Matching in Strings.” SIAM Journal of Computing, v. 6, no. 2, pp. 323-350 (June
1977). This describes another fast search algorithm, which was developed at the same
time as the Boyer-Moore algorithm and uses some of the same ideas. It compares
left-to-right rather than right-to-left: moving backward in the file string is messy if
the file is too large to fit in memory, and one consideration in developing the algorithm
was to avoid this. This is a fairly theoretical article, full of finite-state automata.
(SIAM = Society for Industrial and Applied Mathematics)